The objective of this analytical report is to help companies identify good employees who are at risk of leaving the company. With this information, companies can allocate their finances and resources on in areas that can help in retaining good employees.
First, we will analyze and visualize the data to get a basic understanding of the data inhand (Human Resources Analytics by Ludovic Benistant from kaggle.com). After obtaining a basic understanding of the data, we will check the correlation of the factors to identify and interpret the key factors that drive employees to leave.
Second, we will segment the entire employees by using the cluster method to observe which cluster of employees have a higher possbility of leaving.
Finally, we will bucket the employees (excluding the ones who have stayed) across two dimensions, performance and risk of leaving, in order to predict and identify the employees companies generally wish to retain even at a higher cost - high performing employees with high risk of leaving (and maybe even identify the low performing employees with low possiblity of leaving). This will help the company to target and invest in their human resources and reduce the risk and negative impact of losing high performing employees.
1.1 Load and Explore the data
First, let’s load the data to use.
ProjectData <- read.csv("./data/HR_data.csv")
ProjectData = data.matrix(ProjectData)
Description of the data Can we slightly rename the titles of the data in the excel file - or is it too complicated now?
This is how the first 10 set of data (employees) look like.
| Obs.01 | Obs.02 | Obs.03 | Obs.04 | Obs.05 | Obs.06 | Obs.07 | Obs.08 | Obs.09 | Obs.10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| satisfaction_level | 0.38 | 0.80 | 0.11 | 0.72 | 0.37 | 0.41 | 0.10 | 0.92 | 0.89 | 0.42 |
| last_evaluation | 0.53 | 0.86 | 0.88 | 0.87 | 0.52 | 0.50 | 0.77 | 0.85 | 1.00 | 0.53 |
| number_project | 2.00 | 5.00 | 7.00 | 5.00 | 2.00 | 2.00 | 6.00 | 5.00 | 5.00 | 2.00 |
| average_montly_hours | 157.00 | 262.00 | 272.00 | 223.00 | 159.00 | 153.00 | 247.00 | 259.00 | 224.00 | 142.00 |
| time_spend_company | 3.00 | 6.00 | 4.00 | 5.00 | 3.00 | 3.00 | 4.00 | 5.00 | 5.00 | 3.00 |
| Work_accident | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| left | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| promotion_last_5years | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| salary_level | 1.00 | 2.00 | 2.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| sales | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| accounting | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| hr | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| technical | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| support | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| management | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| IT | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| product_mng | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| marketing | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| RandD | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
The data we use here have the following descriptive statistics.
| min | 25 percent | median | mean | 75 percent | max | std | |
|---|---|---|---|---|---|---|---|
| satisfaction_level | 0.09 | 0.44 | 0.64 | 0.61 | 0.82 | 1 | 0.25 |
| last_evaluation | 0.36 | 0.56 | 0.72 | 0.72 | 0.87 | 1 | 0.17 |
| number_project | 2.00 | 3.00 | 4.00 | 3.80 | 5.00 | 7 | 1.23 |
| average_montly_hours | 96.00 | 156.00 | 200.00 | 201.05 | 245.00 | 310 | 49.94 |
| time_spend_company | 2.00 | 3.00 | 3.00 | 3.50 | 4.00 | 10 | 1.46 |
| Work_accident | 0.00 | 0.00 | 0.00 | 0.14 | 0.00 | 1 | 0.35 |
| left | 0.00 | 0.00 | 0.00 | 0.24 | 0.00 | 1 | 0.43 |
| promotion_last_5years | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | 1 | 0.14 |
| salary_level | 1.00 | 1.00 | 2.00 | 1.59 | 2.00 | 3 | 0.64 |
| sales | 0.00 | 0.00 | 0.00 | 0.28 | 1.00 | 1 | 0.45 |
| accounting | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
| hr | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
| technical | 0.00 | 0.00 | 0.00 | 0.18 | 0.00 | 1 | 0.39 |
| support | 0.00 | 0.00 | 0.00 | 0.15 | 0.00 | 1 | 0.36 |
| management | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 | 1 | 0.20 |
| IT | 0.00 | 0.00 | 0.00 | 0.08 | 0.00 | 1 | 0.27 |
| product_mng | 0.00 | 0.00 | 0.00 | 0.06 | 0.00 | 1 | 0.24 |
| marketing | 0.00 | 0.00 | 0.00 | 0.06 | 0.00 | 1 | 0.23 |
| RandD | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
1.2 Scale the data
Here, we are standardizing the data in order to avoid having the problem of the result being driven by a few relatively large values. We will scale the data between 0 and 1.
ProjectDataFactor_scaled = apply(ProjectDataFactor, 2, function(r) {
res = (r - min(r))/(max(r) - min(r))
res
})
Below is the summary statistics of the scaled dataset.
| min | 25 percent | median | mean | 75 percent | max | std | |
|---|---|---|---|---|---|---|---|
| satisfaction_level | 0 | 0.38 | 0.60 | 0.57 | 0.80 | 1 | 0.27 |
| last_evaluation | 0 | 0.31 | 0.56 | 0.56 | 0.80 | 1 | 0.27 |
| number_project | 0 | 0.20 | 0.40 | 0.36 | 0.60 | 1 | 0.25 |
| average_montly_hours | 0 | 0.28 | 0.49 | 0.49 | 0.70 | 1 | 0.23 |
| time_spend_company | 0 | 0.12 | 0.12 | 0.19 | 0.25 | 1 | 0.18 |
| Work_accident | 0 | 0.00 | 0.00 | 0.14 | 0.00 | 1 | 0.35 |
| left | 0 | 0.00 | 0.00 | 0.24 | 0.00 | 1 | 0.43 |
| promotion_last_5years | 0 | 0.00 | 0.00 | 0.02 | 0.00 | 1 | 0.14 |
| salary_level | 0 | 0.00 | 0.50 | 0.30 | 0.50 | 1 | 0.32 |
| sales | 0 | 0.00 | 0.00 | 0.28 | 1.00 | 1 | 0.45 |
| accounting | 0 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
| hr | 0 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
| technical | 0 | 0.00 | 0.00 | 0.18 | 0.00 | 1 | 0.39 |
| support | 0 | 0.00 | 0.00 | 0.15 | 0.00 | 1 | 0.36 |
| management | 0 | 0.00 | 0.00 | 0.04 | 0.00 | 1 | 0.20 |
| IT | 0 | 0.00 | 0.00 | 0.08 | 0.00 | 1 | 0.27 |
| product_mng | 0 | 0.00 | 0.00 | 0.06 | 0.00 | 1 | 0.24 |
| marketing | 0 | 0.00 | 0.00 | 0.06 | 0.00 | 1 | 0.23 |
| RandD | 0 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
1.3 Check Correlations
The simplest way to have a first look at a dataset is to check the correlation. By doing this, we can easily see which factors have a high positive/negative correlation with leaving employees. This is different from a causality, therefore we cannot conclude that a highly correlated factor (independent variables) leads an employee to leave (dependent variable). Also, if some of the factors (independent variables) are highly correlated with each other, we could also consider to group these attributes together.
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | salary_level | sales | accounting | hr | technical | support | management | IT | product_mng | marketing | RandD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| satisfaction_level | 1.00 | 0.11 | -0.14 | -0.02 | -0.10 | 0.06 | -0.39 | 0.03 | 0.05 | 0.00 | -0.03 | -0.01 | -0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
| last_evaluation | 0.11 | 1.00 | 0.35 | 0.34 | 0.13 | -0.01 | 0.01 | -0.01 | -0.01 | -0.02 | 0.00 | -0.01 | 0.01 | 0.02 | 0.01 | 0.00 | 0.00 | 0.00 | -0.01 |
| number_project | -0.14 | 0.35 | 1.00 | 0.42 | 0.20 | 0.00 | 0.02 | -0.01 | 0.00 | -0.01 | 0.00 | -0.03 | 0.03 | 0.00 | 0.01 | 0.00 | 0.00 | -0.02 | 0.01 |
| average_montly_hours | -0.02 | 0.34 | 0.42 | 1.00 | 0.13 | -0.01 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -0.01 | 0.01 | 0.00 | 0.00 | 0.01 | -0.01 | -0.01 | 0.00 |
| time_spend_company | -0.10 | 0.13 | 0.20 | 0.13 | 1.00 | 0.00 | 0.14 | 0.07 | 0.05 | 0.02 | 0.00 | -0.02 | -0.03 | -0.03 | 0.12 | -0.01 | 0.00 | 0.01 | -0.02 |
| Work_accident | 0.06 | -0.01 | 0.00 | -0.01 | 0.00 | 1.00 | -0.15 | 0.04 | 0.01 | 0.00 | -0.01 | -0.02 | -0.01 | 0.01 | 0.01 | -0.01 | 0.00 | 0.01 | 0.02 |
| left | -0.39 | 0.01 | 0.02 | 0.07 | 0.14 | -0.15 | 1.00 | -0.06 | -0.16 | 0.01 | 0.02 | 0.03 | 0.02 | 0.01 | -0.05 | -0.01 | -0.01 | 0.00 | -0.05 |
| promotion_last_5years | 0.03 | -0.01 | -0.01 | 0.00 | 0.07 | 0.04 | -0.06 | 1.00 | 0.10 | 0.01 | 0.00 | 0.00 | -0.04 | -0.04 | 0.13 | -0.04 | -0.04 | 0.05 | 0.02 |
| salary_level | 0.05 | -0.01 | 0.00 | 0.00 | 0.05 | 0.01 | -0.16 | 0.10 | 1.00 | -0.04 | 0.01 | 0.00 | -0.02 | -0.03 | 0.16 | -0.01 | -0.01 | 0.01 | 0.00 |
| sales | 0.00 | -0.02 | -0.01 | 0.00 | 0.02 | 0.00 | 0.01 | 0.01 | -0.04 | 1.00 | -0.14 | -0.14 | -0.29 | -0.26 | -0.13 | -0.18 | -0.16 | -0.15 | -0.15 |
| accounting | -0.03 | 0.00 | 0.00 | 0.00 | 0.00 | -0.01 | 0.02 | 0.00 | 0.01 | -0.14 | 1.00 | -0.05 | -0.11 | -0.10 | -0.05 | -0.07 | -0.06 | -0.06 | -0.05 |
| hr | -0.01 | -0.01 | -0.03 | -0.01 | -0.02 | -0.02 | 0.03 | 0.00 | 0.00 | -0.14 | -0.05 | 1.00 | -0.11 | -0.10 | -0.05 | -0.07 | -0.06 | -0.06 | -0.05 |
| technical | -0.01 | 0.01 | 0.03 | 0.01 | -0.03 | -0.01 | 0.02 | -0.04 | -0.02 | -0.29 | -0.11 | -0.11 | 1.00 | -0.20 | -0.10 | -0.14 | -0.12 | -0.12 | -0.11 |
| support | 0.01 | 0.02 | 0.00 | 0.00 | -0.03 | 0.01 | 0.01 | -0.04 | -0.03 | -0.26 | -0.10 | -0.10 | -0.20 | 1.00 | -0.09 | -0.12 | -0.11 | -0.10 | -0.10 |
| management | 0.01 | 0.01 | 0.01 | 0.00 | 0.12 | 0.01 | -0.05 | 0.13 | 0.16 | -0.13 | -0.05 | -0.05 | -0.10 | -0.09 | 1.00 | -0.06 | -0.05 | -0.05 | -0.05 |
| IT | 0.01 | 0.00 | 0.00 | 0.01 | -0.01 | -0.01 | -0.01 | -0.04 | -0.01 | -0.18 | -0.07 | -0.07 | -0.14 | -0.12 | -0.06 | 1.00 | -0.08 | -0.07 | -0.07 |
| product_mng | 0.01 | 0.00 | 0.00 | -0.01 | 0.00 | 0.00 | -0.01 | -0.04 | -0.01 | -0.16 | -0.06 | -0.06 | -0.12 | -0.11 | -0.05 | -0.08 | 1.00 | -0.06 | -0.06 |
| marketing | 0.01 | 0.00 | -0.02 | -0.01 | 0.01 | 0.01 | 0.00 | 0.05 | 0.01 | -0.15 | -0.06 | -0.06 | -0.12 | -0.10 | -0.05 | -0.07 | -0.06 | 1.00 | -0.06 |
| RandD | 0.01 | -0.01 | 0.01 | 0.00 | -0.02 | 0.02 | -0.05 | 0.02 | 0.00 | -0.15 | -0.05 | -0.05 | -0.11 | -0.10 | -0.05 | -0.07 | -0.06 | -0.06 | 1.00 |
Satisfaction level is most strongly negatively correlated with employees leaving.
2.1 Select segmentation variables and methods
=======We use all the variables except “Whether the employee has left.” We use Euclidean distance.
segmentation_attributes_used = c(1:6, 8:19)
profile_attributes_used = c(1:19)
numb_clusters_used = 5
profile_with = "hclust"
distance_used = "euclidean"
hclust_method = "ward.D"
Here are the differences between the observations using the distance metric we selected:
| Obs.01 | Obs.02 | Obs.03 | Obs.04 | Obs.05 | Obs.06 | Obs.07 | Obs.08 | Obs.09 | Obs.10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Obs.01 | 0.00 | |||||||||
| Obs.02 | 1.21 | 0.00 | ||||||||
| Obs.03 | 1.39 | 0.89 | 0.00 | |||||||
| Obs.04 | 0.97 | 0.55 | 0.96 | 0.00 | ||||||
| Obs.05 | 0.02 | 1.22 | 1.39 | 0.98 | 0.00 | |||||
| Obs.06 | 0.06 | 1.23 | 1.43 | 0.99 | 0.06 | 0.00 | ||||
| Obs.07 | 1.03 | 0.98 | 0.58 | 0.75 | 1.03 | 1.07 | 0.00 | |||
| Obs.08 | 1.12 | 0.53 | 1.11 | 0.28 | 1.13 | 1.13 | 0.94 | 0.00 | ||
| Obs.09 | 1.17 | 0.60 | 1.12 | 0.28 | 1.18 | 1.19 | 0.97 | 0.29 | 0.00 | |
| Obs.10 | 0.08 | 1.23 | 1.43 | 0.98 | 0.10 | 0.07 | 1.08 | 1.13 | 1.17 | 0 |
2.2 Visualize Pair-wise Distances
We can see the histogram of, say, the first 2 variables.
or the histogram of all pairwise distances for the euclidean distance:
2.3 Number of Segments
=======We can see the histogram of, say, the first 2 variables.
or the histogram of all pairwise distances for the euclidean distance:
Let’s use Hierarchical Clustering methods. It may be useful to see the dendrogram from , to have a quick idea of how the data may be segmented and how many segments there may be. Here is the dendrogram for our data:
We can also plot the “distances” traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers, we see the first 20 here.
<<<<<<< HEAD2.4 Profile and interpret the segments
2.6 Robustness analysis
3.1 Classification tree 3.2 Profit curve
=======For now let’s consider the 5-segments solution. We can also see the segment each observation (respondent in this case) belongs to for the first 20 people:
| Observation Number | Cluster_Membership |
|---|---|
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 1 |
| 6 | 1 |
| 7 | 1 |
| 8 | 1 |
| 9 | 1 |
| 10 | 1 |
| 11 | 1 |
| 12 | 1 |
| 13 | 1 |
| 14 | 1 |
| 15 | 1 |
| 16 | 1 |
| 17 | 1 |
| 18 | 1 |
| 19 | 2 |
| 20 | 1 |
Having decided how many clusters to use, we would like to get a better understanding of who the customers in those clusters are and interpret the segments.
The average values of our data for the total population as well as within each customer segment are:
| Population | Segment 1 | Segment 2 | Segment 3 | Segment 4 | Segment 5 | |
|---|---|---|---|---|---|---|
| satisfaction_level | 0.57 | 0.57 | 0.57 | 0.57 | 0.58 | 0.58 |
| last_evaluation | 0.56 | 0.55 | 0.56 | 0.56 | 0.57 | 0.56 |
| number_project | 0.36 | 0.36 | 0.36 | 0.38 | 0.36 | 0.36 |
| average_montly_hours | 0.49 | 0.49 | 0.49 | 0.50 | 0.49 | 0.50 |
| time_spend_company | 0.19 | 0.19 | 0.20 | 0.18 | 0.17 | 0.18 |
| Work_accident | 0.14 | 0.14 | 0.15 | 0.14 | 0.15 | 0.13 |
| left | 0.24 | 0.25 | 0.22 | 0.26 | 0.25 | 0.22 |
| promotion_last_5years | 0.02 | 0.00 | 0.07 | 0.00 | 0.00 | 0.00 |
| salary_level | 0.30 | 0.27 | 0.34 | 0.28 | 0.27 | 0.29 |
| sales | 0.28 | 1.00 | 0.02 | 0.00 | 0.00 | 0.00 |
| accounting | 0.05 | 0.00 | 0.16 | 0.00 | 0.00 | 0.00 |
| hr | 0.05 | 0.00 | 0.15 | 0.00 | 0.00 | 0.00 |
| technical | 0.18 | 0.00 | 0.01 | 1.00 | 0.00 | 0.00 |
| support | 0.15 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 |
| management | 0.04 | 0.00 | 0.13 | 0.00 | 0.00 | 0.00 |
| IT | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| product_mng | 0.06 | 0.00 | 0.19 | 0.00 | 0.00 | 0.00 |
| marketing | 0.06 | 0.00 | 0.18 | 0.00 | 0.00 | 0.00 |
| RandD | 0.05 | 0.00 | 0.16 | 0.00 | 0.00 | 0.00 |
we can measure the ratios of the average for each cluster to the average of the population and subtract 1 (e.g. avg(cluster) / avg(population) - 1) and explore a matrix as the following one:
| Segment 1 | Segment 2 | Segment 3 | Segment 4 | Segment 5 | |
|---|---|---|---|---|---|
| satisfaction_level | 0.00 | 0.00 | -0.01 | 0.01 | 0.01 |
| last_evaluation | -0.02 | 0.00 | 0.01 | 0.02 | 0.00 |
| number_project | -0.02 | -0.01 | 0.04 | 0.00 | 0.01 |
| average_montly_hours | 0.00 | -0.01 | 0.01 | 0.00 | 0.01 |
| time_spend_company | 0.02 | 0.05 | -0.06 | -0.07 | -0.02 |
| Work_accident | -0.04 | 0.04 | -0.03 | 0.07 | -0.08 |
| left | 0.05 | -0.09 | 0.08 | 0.05 | -0.07 |
| promotion_last_5years | -1.00 | 2.08 | -1.00 | -1.00 | -0.89 |
| salary_level | -0.08 | 0.13 | -0.04 | -0.08 | -0.04 |
| sales | 2.62 | -0.93 | -1.00 | -1.00 | -1.00 |
| accounting | -1.00 | 2.10 | -1.00 | -1.00 | -1.00 |
| hr | -1.00 | 2.10 | -1.00 | -1.00 | -1.00 |
| technical | -1.00 | -0.97 | 4.51 | -1.00 | -1.00 |
| support | -1.00 | -0.97 | -1.00 | 5.73 | -1.00 |
| management | -1.00 | 2.10 | -1.00 | -1.00 | -1.00 |
| IT | -1.00 | -1.00 | -1.00 | -1.00 | 11.22 |
| product_mng | -1.00 | 2.10 | -1.00 | -1.00 | -1.00 |
| marketing | -1.00 | 2.10 | -1.00 | -1.00 | -1.00 |
| RandD | -1.00 | 2.10 | -1.00 | -1.00 | -1.00 |
The segment profile looks to depend too much on department.
Let’s try the analysis again excluding department information.
We use all the variables except “Whether the employee has left.” and department. We use Euclidean distance.
segmentation_attributes_used = c(1:6, 9)
profile_attributes_used = c(1:9)
numb_clusters_used = 5
profile_with = "hclust"
distance_used = "euclidean"
hclust_method = "ward.D"
Here are the differences between the observations using the distance metric we selected:
| Obs.01 | Obs.02 | Obs.03 | Obs.04 | Obs.05 | Obs.06 | Obs.07 | Obs.08 | Obs.09 | Obs.10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Obs.01 | 0.00 | |||||||||
| Obs.02 | 1.21 | 0.00 | ||||||||
| Obs.03 | 1.39 | 0.89 | 0.00 | |||||||
| Obs.04 | 0.97 | 0.55 | 0.96 | 0.00 | ||||||
| Obs.05 | 0.02 | 1.22 | 1.39 | 0.98 | 0.00 | |||||
| Obs.06 | 0.06 | 1.23 | 1.43 | 0.99 | 0.06 | 0.00 | ||||
| Obs.07 | 1.03 | 0.98 | 0.58 | 0.75 | 1.03 | 1.07 | 0.00 | |||
| Obs.08 | 1.12 | 0.53 | 1.11 | 0.28 | 1.13 | 1.13 | 0.94 | 0.00 | ||
| Obs.09 | 1.17 | 0.60 | 1.12 | 0.28 | 1.18 | 1.19 | 0.97 | 0.29 | 0.00 | |
| Obs.10 | 0.08 | 1.23 | 1.43 | 0.98 | 0.10 | 0.07 | 1.08 | 1.13 | 1.17 | 0 |
We can see the histogram of, say, the first 2 variables.
or the histogram of all pairwise distances for the euclidean distance:
Let’s use Hierarchical Clustering methods. It may be useful to see the dendrogram from , to have a quick idea of how the data may be segmented and how many segments there may be. Here is the dendrogram for our data:
We can also plot the “distances” traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers, we see the first 20 here.
For now let’s consider the 5-segments solution. We can also see the segment each observation (respondent in this case) belongs to for the first 20 people:
| Observation Number | Cluster_Membership |
|---|---|
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 1 |
| 6 | 1 |
| 7 | 3 |
| 8 | 4 |
| 9 | 4 |
| 10 | 1 |
| 11 | 1 |
| 12 | 3 |
| 13 | 4 |
| 14 | 1 |
| 15 | 1 |
| 16 | 1 |
| 17 | 1 |
| 18 | 4 |
| 19 | 5 |
| 20 | 4 |
Having decided how many clusters to use, we would like to get a better understanding of who the customers in those clusters are and interpret the segments.
The average values of our data for the total population as well as within each customer segment are:
| Population | Segment 1 | Segment 2 | Segment 3 | Segment 4 | Segment 5 | |
|---|---|---|---|---|---|---|
| satisfaction_level | 0.57 | 0.36 | 0.70 | 0.08 | 0.69 | 0.61 |
| last_evaluation | 0.56 | 0.25 | 0.59 | 0.70 | 0.60 | 0.55 |
| number_project | 0.36 | 0.04 | 0.36 | 0.70 | 0.36 | 0.36 |
| average_montly_hours | 0.49 | 0.24 | 0.50 | 0.69 | 0.51 | 0.49 |
| time_spend_company | 0.19 | 0.13 | 0.20 | 0.28 | 0.16 | 0.19 |
| Work_accident | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| left | 0.24 | 0.78 | 0.09 | 0.55 | 0.15 | 0.08 |
| promotion_last_5years | 0.02 | 0.01 | 0.04 | 0.01 | 0.01 | 0.04 |
| salary_level | 0.30 | 0.25 | 0.58 | 0.27 | 0.00 | 0.30 |
we can measure the ratios of the average for each cluster to the average of the population and subtract 1 (e.g. avg(cluster) / avg(population) - 1) and explore a matrix as the following one:
| Segment 1 | Segment 2 | Segment 3 | Segment 4 | Segment 5 | |
|---|---|---|---|---|---|
| satisfaction_level | -0.38 | 0.22 | -0.86 | 0.20 | 0.07 |
| last_evaluation | -0.55 | 0.05 | 0.25 | 0.07 | -0.01 |
| number_project | -0.90 | 0.00 | 0.94 | 0.01 | -0.01 |
| average_montly_hours | -0.51 | 0.03 | 0.40 | 0.03 | -0.01 |
| time_spend_company | -0.31 | 0.09 | 0.48 | -0.16 | 0.01 |
| Work_accident | -1.00 | -1.00 | -1.00 | -1.00 | 5.92 |
| left | 2.29 | -0.64 | 1.33 | -0.38 | -0.67 |
| promotion_last_5years | -0.37 | 0.69 | -0.72 | -0.69 | 0.65 |
| salary_level | -0.16 | 0.94 | -0.08 | -1.00 | 0.02 |
Following the analysis above, several business decisions can be made.
First, companies can implement policies to control attrition rates, by managing the variables that have a high correlation with employees leaving.